1 00:00:10,160 --> 00:00:14,560 Good afternoon, everybody. It's lovely to see  you all here today, joining us for the second   2 00:00:14,560 --> 00:00:20,400 session in our series with the Internet Archive,  DWeb and Library Futures. The entire series is   3 00:00:20,400 --> 00:00:25,440 titled "Imagining a Better Online World: Exploring  the Decentralized Web," and today we'll be talking   4 00:00:25,440 --> 00:00:30,880 about using the decentralized storage to keep your  materials safe. My name is Davis Erin Anderson. I'm   5 00:00:30,880 --> 00:00:36,240 assistant director for programs and partnerships  at METRO. Please say hello in chat. We'd love to   6 00:00:36,240 --> 00:00:41,120 know who's out there. Where are you from, what's  your name, and what's your interest in this topic?   7 00:00:41,120 --> 00:00:44,320 We'd love to know who's in the audience and  hear from you a little bit as we get started   8 00:00:44,880 --> 00:00:50,240 METRO is a multi-type consortium. We serve the  five boroughs of New York City and Westchester   9 00:00:50,240 --> 00:00:54,880 County. We're a service provider, we do events like  this and partnership programs like the one you're   10 00:00:54,880 --> 00:00:59,920 attending today. We have a group that works  on software development. We provide delivery   11 00:00:59,920 --> 00:01:04,080 services and make sure that knowledge can be  spread equitably throughout our service area,   12 00:01:04,080 --> 00:01:08,880 so we really care a lot about the future of  how information moves. And so we are pleased   13 00:01:08,880 --> 00:01:13,600 and honored today to support the work the  folks are doing at Internet Archive. We wanted to   14 00:01:13,600 --> 00:01:18,160 hear a little bit more about what they envision  for the future of the web, so we're running a   15 00:01:18,160 --> 00:01:23,520 six-part series. This is the second part. We'll  drop a link into chat that lets you see where   16 00:01:23,520 --> 00:01:28,080 to go to register for upcoming sessions as well. Check back on the resources we're providing   17 00:01:28,080 --> 00:01:33,440 for this one and the past sessions as  well. If you would please drop your questions   18 00:01:33,440 --> 00:01:38,080 into chat, and your comments as well. We had a  really robust and active conversation going for   19 00:01:38,080 --> 00:01:42,560 our first session and we'd love to see that happen  here again. We're also providing resource guides to   20 00:01:42,560 --> 00:01:48,240 go with each and every one of our six parts of the  series, so please also look at chat for a link to   21 00:01:48,240 --> 00:01:53,680 the current guide and please stay tuned your inbox. If you're registered you'll receive a pdf copy   22 00:01:53,680 --> 00:01:59,520 as well. It's my pleasure to introduce you to  Wendy Hamamura. Wendy is Director of Partnerships   23 00:01:59,520 --> 00:02:04,400 at Internet Archive. She planned the first Decentralized Web Summit a few years back   24 00:02:05,040 --> 00:02:09,200 In the past six years, she's helped to guide the  global growth of the decentralized web, so she's   25 00:02:09,200 --> 00:02:14,480 really the expert on this topic. And she's  here today to co-produce the six part series   26 00:02:14,480 --> 00:02:18,640 Imagining a Better Online World: Exploring the  Decentralized Web. Thank you so much, Wendy. Over to you.   27 00:02:18,640 --> 00:02:23,040 Thank you, Davis, and thanks to all of you for being   28 00:02:23,040 --> 00:02:28,480 here today. I'm seeing friends from Berlin and  Argentina, many many from New York and Florida.  29 00:02:28,480 --> 00:02:34,160 We're so happy that you can be here to learn a  little bit about decentralized storage. Now, in this   30 00:02:34,160 --> 00:02:40,640 webinar we're going to be exploring with you a new  set of decentralized technologies that may help   31 00:02:40,640 --> 00:02:46,560 you to preserve and provide access to your media.  So here's the game plan: for the next 60 minutes   32 00:02:47,120 --> 00:02:52,160 I'm going to start by giving us an overview of  some of the problems that decentralized storage   33 00:02:52,160 --> 00:02:57,600 could help to solve then I have invited a  friend of mine the founder of Starling Lab   34 00:02:57,600 --> 00:03:03,520 to share with you how his group is working with  many many cultural institutions to keep their most   35 00:03:03,520 --> 00:03:09,440 critical and important materials safe. We also  want to show you this tech in action, so I've   36 00:03:09,440 --> 00:03:15,840 invited two people to demonstrate what they've  been working on. First, an engineer of ours from   37 00:03:15,840 --> 00:03:21,600 the Internet Archive is going to be showing you  how we've been experimenting saving web archives   38 00:03:21,600 --> 00:03:28,720 at scale and filecoin, and a senior engineer from  the storage decentralized Storj company is   39 00:03:28,720 --> 00:03:34,960 going to show you how we've been storing Librevox  audio books in decentralized storage. Now both of   40 00:03:34,960 --> 00:03:41,600 these collections the web archives and the audio  books were created collaboratively by communities   41 00:03:41,600 --> 00:03:47,120 and I think that's the real promise here:that you  could take collaborative collections and perhaps   42 00:03:47,120 --> 00:03:52,560 store them and preserve them collaboratively  as well. So let's start by thinking about some   43 00:03:52,560 --> 00:03:58,480 of the challenges. Now many of you are archivists,  you're librarians, you run cultural institutions. So   44 00:03:58,480 --> 00:04:06,320 this is very familiar. Your collections are  ever expanding in the physical world but also   45 00:04:06,320 --> 00:04:12,960 in the digital realm digital objects may  be even harder to store, right. How do you   46 00:04:12,960 --> 00:04:19,600 keep things safe, not only from floods and fires,  but also secure from hackers. How do you make them   47 00:04:19,600 --> 00:04:25,760 accessible in a time when there are broken links  and content drift? How do you make sure that your   48 00:04:25,760 --> 00:04:34,160 data is trustworthy, especially in an era when  deep fakes are growing? Then there's the scale   49 00:04:34,160 --> 00:04:40,400 of digital holdings, which seem to be enormous,  and isn't it true that wading digital objects   50 00:04:40,400 --> 00:04:46,880 feels a little bit wrong since they're just bits?  How do you weed ever-growing digital collection?   51 00:04:48,160 --> 00:04:54,320 And what about the long-term preservation, the  sustainability of this collection? How do you   52 00:04:54,320 --> 00:05:01,280 do digital storage in centuries? And let's  not forget the issue of cost it is so hard   53 00:05:01,280 --> 00:05:07,520 to predict the future costs of decentralized  storage, especially when technology is changing   54 00:05:07,520 --> 00:05:16,080 all the time. Now, that takes us to this think of  the decentralized web as a stack with every layer   55 00:05:16,080 --> 00:05:22,480 of the web stack potentially decentralized. When  you take all of these decentralized technologies   56 00:05:22,480 --> 00:05:27,920 together, that's what we call the decentralized web,  and you'll notice in this diagram that the bottom   57 00:05:27,920 --> 00:05:34,800 layer is decentralized storage. That's the layer  we're going to be exploring today conceptually   58 00:05:34,800 --> 00:05:41,440 decentralized storage allows you to store your  data across a peer-to-peer network of servers,   59 00:05:42,080 --> 00:05:47,600 but so does Amazon cloud, right. So what's the  difference? I would say that the difference here   60 00:05:47,600 --> 00:05:55,360 is really that not only is your storage location  distributed but also your storage management is   61 00:05:55,360 --> 00:06:01,920 decentralized. That way you can't take out just  one central control entity like Amazon and have   62 00:06:01,920 --> 00:06:08,640 the entire system go down. So what is the promise?  What does decentralized storage offer? Well, first   63 00:06:08,640 --> 00:06:14,160 there's the concept of resiliency. Now we're very  familiar with that in the library world: there's   64 00:06:14,160 --> 00:06:20,080 LOCKSS: lots of copies keep things safe. So we know  that if you distribute copies across different   65 00:06:20,080 --> 00:06:26,960 geographic lines, geopolitical lines, it's going to  be safer. Then there's the concept of persistence.   66 00:06:26,960 --> 00:06:32,000 Now this is something that a lot of people get  wrong when they think about the decentralized web.   67 00:06:32,000 --> 00:06:38,560 Just because you cut up a file and put pieces of  it in different servers does not mean that those   68 00:06:38,560 --> 00:06:44,880 servers are guaranteed to keep your files forever.  Now persistence would mean that you'd have to have   69 00:06:44,880 --> 00:06:51,680 a guarantee somehow built in that the people who  hold your copies will hold them forever, or for   70 00:06:51,680 --> 00:06:57,840 a long time. So how do you ensure persistence?  Well in truth I don't think we're really sure   71 00:06:57,840 --> 00:07:04,640 about that, but organizations like File Coin and  Storj are using a combination of incentives and   72 00:07:04,640 --> 00:07:12,000 shared protocols and contracts to try  to ensure persistence. Next I think this   73 00:07:12,000 --> 00:07:19,280 step, self-certification, is the most important  attribute of decentralized storage.   74 00:07:19,280 --> 00:07:25,440 Here, every item is assigned a unique  immutable hash, a persistent ID   75 00:07:26,400 --> 00:07:32,160 and you use this ID to find your things, wherever they are, and 76 00:07:32,160 --> 00:07:40,480 to check how many people have copies of them.   This is something we call content addressing   77 00:07:40,480 --> 00:07:44,400 and in Web 2.0 you find things based on  where they're located. You have a   78 00:07:44,400 --> 00:07:49,600 URL that takes you to a place on a server.   Well, in Web 3.0, or the decentralized web,   79 00:07:50,320 --> 00:07:58,000 the ID remains with the content itself. And if  the content changes, so does the hash. So anytime   80 00:07:58,000 --> 00:08:03,840 something is altered, you get a new hash. And  ostensibly the self-certification is what allows   81 00:08:03,840 --> 00:08:12,320 you to ensure the provenance and authenticate an  item. Finally, there is the goal of interoperability.   82 00:08:12,320 --> 00:08:18,960 I think it's pretty true that right now we have a  lot of silos where our materials live, and when you   83 00:08:18,960 --> 00:08:25,520 want to work collaboratively on a shared data set,  that can be very problematic. Now, in the utopian   84 00:08:25,520 --> 00:08:32,160 version of decentralized storage, you can have  collaborative authenticated co-hosted collections   85 00:08:32,720 --> 00:08:37,920 and these collections would be less prone to  censorship, because you can't block just one   86 00:08:37,920 --> 00:08:44,560 URL and block the entire collection. They're also  perhaps harder to hack because there's not one   87 00:08:44,560 --> 00:08:52,240 single honeypot to go after. They may be easier  to share taken together. Resiliency, persistence,   88 00:08:52,240 --> 00:08:57,920 self-certification, and interoperability -- that is the promise of decentralized storage.   89 00:08:57,920 --> 00:09:02,960 But it is still early days. So whether or not we  can deliver on those things is something we're   90 00:09:02,960 --> 00:09:09,520 testing now. It is my deep pleasure to bring on  Jonathan Dotan. He's the founder of Starling Lab,   91 00:09:10,080 --> 00:09:15,200 which is the first major research laboratory  devoted to Web 3 technologies. It's affiliated   92 00:09:15,200 --> 00:09:21,760 with Stanford and USC and Iknow that Starling  has been working for quite a while with the Shoah   93 00:09:21,760 --> 00:09:28,240 Foundation to make sure that Holocaust testimony  videos are kept safe and persistent. But here's a   94 00:09:28,240 --> 00:09:34,720 fun fact: I first met Jonathan Donta back in  2018 when he was the consultant for HBO's Silicon   95 00:09:34,720 --> 00:09:41,280 Valley and it was Jonathan Dotan who convinced  the show runners to introduce a storyline about a   96 00:09:41,280 --> 00:09:46,720 new internet, a decentralized internet. And that's  how he came to be involved with us at the DWeb   97 00:09:46,720 --> 00:09:50,367 community. Welcome, Jonathan Dotan, founder of  Starling Lab. 98 00:09:50,834 --> 00:09:57,120 Thanks so much wendy for having me and um to the entire community that's assembled here. I can't think of a more appropriate group   99 00:09:57,120 --> 00:10:02,240 of folks to be speaking to about decentralized  storage, because certainly the power of archiving   100 00:10:02,240 --> 00:10:10,800 institutions and libraries, and providing a new  layer of trust for communities in preservation   101 00:10:10,800 --> 00:10:16,720 is unique. And I'm really excited to help bring  you into the fold to help answer any questions and   102 00:10:16,720 --> 00:10:19,605 potentially inspire you on the possibilities. 103 00:10:20,242 --> 00:10:21,440 At the Starling Lab, we've been working on 104 00:10:21,440 --> 00:10:25,760 call a framework for data integrity that  allows you, end to end, to think about how   105 00:10:25,760 --> 00:10:33,440 you capture, store, and verify information. And the  the page that we really are working from here   106 00:10:33,440 --> 00:10:39,760 is is one that was written many years ago. So I  want to start with a little bit of context today,   107 00:10:39,760 --> 00:10:45,280 share with you a prototype of some of the  early work that we've done and and then get into   108 00:10:45,280 --> 00:10:49,280 some of the learnings and how they might apply  over to you and some of your archival use cases   109 00:10:50,480 --> 00:10:55,600 So to begin. Wendy's talked a little bit  about the goals of decentralization but   110 00:10:55,600 --> 00:11:00,880 I want to start even a little bit more upstream  from there and just get into a very simple but   111 00:11:00,880 --> 00:11:07,040 also I think poignant understanding of  how decentralization works. In the prior   112 00:11:07,040 --> 00:11:13,280 view of communication systems, when -- let's say AT&T  as an example was a dominant form of guaranteeing   113 00:11:14,080 --> 00:11:20,400 communications infrastructure -- you had to rely on  AT&T as a centralized node. Basically, all the   114 00:11:20,400 --> 00:11:25,920 data went into AT&T, and then went out of AT&T in  order to be passed along, in the case of something   115 00:11:25,920 --> 00:11:32,720 like voice communications. The distributed web was  something that was not actually done, let's say in   116 00:11:32,720 --> 00:11:37,760 the 90s, when the web took off, it's actually was a  part of the original architecture of the internet.   117 00:11:38,320 --> 00:11:43,440 And when Paul BarAn over at the RAND Institute  set forth how a digital network of this kind   118 00:11:43,440 --> 00:11:49,040 might work, he explained that what was critical  was to establish nodes that were interoperable.   119 00:11:49,040 --> 00:11:53,680 So that meant that if any of those nodes went  down they could be replaced and functioned by   120 00:11:53,680 --> 00:11:57,760 the others. That there was neutrality that the  network itself didn't discriminate, but simply   121 00:11:57,760 --> 00:12:03,520 passed along the information. And that meant  that you could allow the end user to hold   122 00:12:03,520 --> 00:12:08,560 all of the intelligence and the rich applications  that might exist on this type of network   123 00:12:09,280 --> 00:12:15,760 and so this concept was really baked into a  larger philosophical framework that was put into   124 00:12:15,760 --> 00:12:20,400 play by Doug Engelbart and the team over at  the Augmented Intelligence Framework and what it   125 00:12:20,400 --> 00:12:26,800 posited is that, for a decentralized network of  knowledge to exist, you needed to have at the ends   126 00:12:27,440 --> 00:12:32,880 computers that could be used by the average  user, so a personal computer. And the concept of   127 00:12:32,880 --> 00:12:38,320 the internet were actually burst out of the same  framework. That's not actually commonly understood,   128 00:12:38,320 --> 00:12:44,400 but it makes sense on many levels. As you look  at the history of the web as it took off in the   129 00:12:44,400 --> 00:12:50,240 early 90s, it's no accident that the first HTTP  server was actually a personal computer. And, funny   130 00:12:50,240 --> 00:12:56,080 enough, this is Tim Berners-Lee's machine over at  CERN. You can see there's a sticker that said, to   131 00:12:56,080 --> 00:13:00,720 clarify that this is not just a personal computer,  but it's actually a server. And he explained to   132 00:13:00,720 --> 00:13:05,840 people that they shouldn't turn off this machine  because it was basically hosting information .  133 00:13:06,800 --> 00:13:12,160 From those early promising visions  around decentralized technologies and and   134 00:13:12,160 --> 00:13:18,160 how one might want to build a global internet  infrastructure, there's a growing realization that   135 00:13:18,160 --> 00:13:23,840 now as the internet has taken off and working at  scale with so many parts of our lives touching it,   136 00:13:23,840 --> 00:13:28,720 but something really is a myth, that as the  internet has become more centralized and dominated   137 00:13:28,720 --> 00:13:35,200 by corporations that have often monopolistic  practices, that there's been tremendous issues   138 00:13:35,200 --> 00:13:40,160 with establishing trust in this type of system and  that we might want to return to decentralization   139 00:13:40,880 --> 00:13:45,360 to restore trust and to restore a sense of  fairness within the internet. So that's where a lot   140 00:13:45,360 --> 00:13:50,800 of these ideas are coming from, to be clear, this  is actually how the original original internet   141 00:13:50,800 --> 00:13:56,320 was designed and meant to be. The decentralized  web summit in 2018, which is how I got really   142 00:13:56,320 --> 00:14:02,560 into the fold, was a program that was hosted by  Wendy and Brewster over at the Internet Archive   143 00:14:02,560 --> 00:14:06,480 and it was really transformative in bringing  a number of different individuals together to   144 00:14:06,480 --> 00:14:12,960 think about these issues. And it's important  to mention the importance of this   145 00:14:12,960 --> 00:14:18,000 event was that it was a cultural event  as well as a technical event in which people were   146 00:14:18,000 --> 00:14:22,720 thinking about shared values that were distinctly  different from things you may have heard of   147 00:14:22,720 --> 00:14:28,480 around blockchains or cryptocurrencies. This was a group of people that were concerned with,   148 00:14:28,480 --> 00:14:32,800 how do you guarantee access to knowledge and  how can you use decentralized technologies   149 00:14:32,800 --> 00:14:37,760 for things like preservation? Which is our  topic today. So the Starling Lab came out   150 00:14:37,760 --> 00:14:43,040 of that rich tradition and in working  with the USC Shoah Foundation and Stanford's   151 00:14:43,040 --> 00:14:47,360 Department of Electrical Engineering. We've  found this incredible opportunity to bring   152 00:14:47,360 --> 00:14:52,560 together experts to think about how we can deploy  decentralized technologies to advance human rights.   153 00:14:53,520 --> 00:14:59,440 We work with a number of different  industry partners. And the founding   154 00:15:00,240 --> 00:15:04,320 work and research that we did was with the  USC Shoah Foundation's visual history archive.  155 00:15:04,960 --> 00:15:10,160 As Wendy mentioned, this is an archive that deals  with the testimony of the survivors of genocide.   156 00:15:10,160 --> 00:15:16,560 It started 30 years ago by cataloging the stories  of survivors of the Holocaust. It expanded   157 00:15:16,560 --> 00:15:21,680 and now they're working on, I believe they're on  their 14th genocide collection as of last week.   158 00:15:23,120 --> 00:15:28,640 And sadly of course that number continues  to increase. There are over 55,000 survivors'   159 00:15:28,640 --> 00:15:34,480 testimonies, on average it's about two and a  half hours and several gigabytes for every   160 00:15:35,280 --> 00:15:38,960 testimony. So it's a massive four  petabyte collection and currently   161 00:15:38,960 --> 00:15:44,160 it sits in three different data centers that are  all state-of-the-art tape-based archival systems.   162 00:15:44,720 --> 00:15:49,600 But working with them, we've really  reimagined a continuum of preservation that goes   163 00:15:49,600 --> 00:15:56,720 beyond just their data centers that are maintained  by the USC Shoah Foundation. Bravely their CTO   164 00:15:56,720 --> 00:16:02,480 Sam Gusman has been working with us on figuring  out a way to take the entire four petabytes of   165 00:16:02,480 --> 00:16:08,640 the Shoah Foundation's archive and put it on to  to the distributed web. In addition I   166 00:16:08,640 --> 00:16:12,560 should mention that he's also looking at longer  term media storage like storage on Silica and   167 00:16:12,560 --> 00:16:18,240 DNA et cetera, so they're quite innovative and  progressive and thinking about preservation. The   168 00:16:18,240 --> 00:16:23,360 type of content that we've been working on is in  looking at genocide testimonies. We've expanded   169 00:16:24,000 --> 00:16:29,040 our testimony collections with them to understand  how the whole life cycle of testimony is collected   170 00:16:29,040 --> 00:16:37,440 and preserved and indexed. We've gone to Iraq,  Los Angeles, the Amazon rainforest.    171 00:16:37,440 --> 00:16:42,880 Wev'e been working in Syria on preserving testimony that could be potentially, not only useful for   172 00:16:42,880 --> 00:16:48,640 humanitarian causes, but also for accountability as  well. We look at preservation for those purposes.   173 00:16:49,360 --> 00:16:53,040 And finally we've been working with  news organizations like Reuters to look at   174 00:16:53,040 --> 00:16:58,640 their archives. And most recently we finished a  project with them last year called the 78 Days   175 00:16:58,640 --> 00:17:04,800 which cataloged each of the 78 days between  the election and the inauguration that included   176 00:17:04,800 --> 00:17:11,280 January 6th as well. What we found is that  what we're creating as a set of solutions is not   177 00:17:11,280 --> 00:17:17,040 just centered around preservation but it's also  around restoring trust. And so what I want to show   178 00:17:17,040 --> 00:17:22,880 you today is how we think about how that might  work, end to end. And it begins by following   179 00:17:22,880 --> 00:17:28,720 the natural life cycle of how you go and generate  data, which would start with capturing. And then   180 00:17:28,720 --> 00:17:35,040 you move to storage, and then finally verification.  And each of those steps are critical to   181 00:17:35,760 --> 00:17:40,320 ensuring that you have a data set that could  be trusted. And what I want to show you today   182 00:17:40,320 --> 00:17:45,840 is how decentralized systems can actually  guarantee trust at each of these three stages.    183 00:17:46,560 --> 00:17:52,000 Let's begin with capturing on something like, let's  say, a mobile phone. In our case, what we've done is   184 00:17:52,000 --> 00:17:57,280 we've taken not only the phone's ability to take  a photo, but also looked at all the other sensor   185 00:17:57,280 --> 00:18:03,440 information that exists on the phone like GPS  for location network information to establish   186 00:18:03,440 --> 00:18:08,640 a relative location the gyroscope to understand  the relative position of the camera and even   187 00:18:08,640 --> 00:18:12,160 things like time and date. These are  all critical pieces of metadata that the   188 00:18:12,720 --> 00:18:18,400 phone is able to generate. What we do is we take  that metadata and we pair it with the image so   189 00:18:18,400 --> 00:18:23,280 that every time that you take a photo, you now  have the payload not only of the image pixels   190 00:18:23,280 --> 00:18:31,040 but also of this metadata. Now through our process  of working with HTC we've been able to take this   191 00:18:31,040 --> 00:18:35,120 payload and do something really special with  it on the device, which is that we first of all   192 00:18:35,120 --> 00:18:40,480 create a hash of it on the device itself. And then  we sign that hash with a cryptographic key that is   193 00:18:40,480 --> 00:18:46,320 guaranteed by a firmware which is on specialized  hardware inside the phones. So what that means,   194 00:18:46,320 --> 00:18:53,200 really simply, is that now you have a unique  fingerprint of both the image and the metadata.   195 00:18:53,200 --> 00:18:59,520 And we've signed that so that we know that that  fingerprint is secure. So with that payload of   196 00:18:59,520 --> 00:19:04,000 preservation information and all the metadata we  now take it and we put it on the decentralized   197 00:19:04,000 --> 00:19:12,880 web. So that step begins by first creating a CID,  which is a unique identifier for that payload.   198 00:19:12,880 --> 00:19:17,840 And then we spread it out across the decentralized  web, basically splitting up into different pieces.   199 00:19:18,560 --> 00:19:24,400 And there we can store it onto different types  of nodes. So you can imagine academic institutions,   200 00:19:24,400 --> 00:19:32,400 non-profits, enterprise cloud, even small devices  like the Raspberry Pi, or a personal   201 00:19:32,400 --> 00:19:38,400 computer. Even a phone all of these different nodes  are, in our minds, appropriate for storage because   202 00:19:38,400 --> 00:19:43,440 we want to diversify storage. And that's really  critical to our framework. In addition to that,   203 00:19:43,440 --> 00:19:48,880 we use cryptography and advanced proof of space/time like the kind that Filecoin has, for example,   204 00:19:48,880 --> 00:19:54,080 to ensure that as you spread information far and  wide, you're also ensuring that its integrity is   205 00:19:54,080 --> 00:19:59,520 kept. And then if any of those nodes which takes  the data, manipulates the data, we now have a way   206 00:19:59,520 --> 00:20:05,440 of proving that in fact that manipulation has  occurred. Paradoxically, what this means is that   207 00:20:05,440 --> 00:20:10,640 as you spread information farther and wider, not  only are you able to preserve the information   208 00:20:11,200 --> 00:20:14,240 better, but you're actually able to  create a seal around the information.   209 00:20:14,880 --> 00:20:19,840 And with more and more nodes conjoining that  network the harder it is to break that seal.   210 00:20:20,800 --> 00:20:25,760 So that's our preservation story. But as you  all know in working in the archival space,   211 00:20:25,760 --> 00:20:29,440 it doesn't end there. Just because you have  a record of something that you prove has   212 00:20:29,440 --> 00:20:34,720 not been manipulated, still the contents of the  objects matter. They need to be examined   213 00:20:34,720 --> 00:20:40,240 and they need to be indexed. And so the expert  certification of the content of information   214 00:20:40,240 --> 00:20:44,480 is something that is normally done through an  archival process of indexing and verification. So   215 00:20:44,480 --> 00:20:50,640 we take those applications, and those too, we also  put those on decentralized systems so that those   216 00:20:50,640 --> 00:20:56,480 records of the authenticity and the verification  of the content itself can also be preserved   217 00:20:56,480 --> 00:21:03,360 on a decentralized system. So that becomes  basically the three parts of our system: capture,   218 00:21:03,360 --> 00:21:09,280 store ,and verify. As a last step I want  to show you where all this stuff is stored.   219 00:21:09,840 --> 00:21:14,960 In working with Adobe and Microsoft and the  Linux Foundation, we've been helping pioneer   220 00:21:14,960 --> 00:21:19,600 a set of standards called the C2PA, which allow  you to take all this information and actually   221 00:21:19,600 --> 00:21:27,120 put it directly, as an example, with a jpeg inside  the photograph itself so that now the photograph   222 00:21:27,120 --> 00:21:32,160 becomes a universe of information not only of  image pixel data but also of these cryptographic   223 00:21:32,160 --> 00:21:38,880 proofs and also of these verifications. So now if,  for instance, I, in this example, can use a small app,   224 00:21:38,880 --> 00:21:44,240 I can click on this eye and I can see all of the  information around the photograph and its metadata   225 00:21:44,240 --> 00:21:49,280 and I can also see the links back to where  this information sits on the decentralized web.   226 00:21:49,280 --> 00:21:53,280 All of this just contained inside of the jpeg. So we think that's pretty nifty, because it   227 00:21:53,280 --> 00:21:59,120 now changes every photograph from being just a  photo -- a container of image pixels -- to now being   228 00:21:59,120 --> 00:22:03,840 a universe of information for fact checking,  for image verification, et cetera, et cetera.   229 00:22:05,040 --> 00:22:09,680 Alright, I'm going to close very quickly by just  going through our prototype and and some of our   230 00:22:09,680 --> 00:22:14,560 learnings. So I described to you a little bit about  the work that we did with Reuters, but it really   231 00:22:14,560 --> 00:22:20,640 was an unfolding set of experiments during the  course of the 2020 election. And what we did is, we   232 00:22:20,640 --> 00:22:26,000 used our technology to go out with photojournalist  at Reuters. And I'll show you, end to end, how we   233 00:22:26,000 --> 00:22:31,520 establish this new form of digital trust. It  began by having photos from this professional-grade  234 00:22:31,520 --> 00:22:36,240 camera move over to the phone, where it was  notarized through the process I've described. And   235 00:22:36,240 --> 00:22:40,400 then it ends up in the CMS system at the Reuters  headquarters at their photo desk in London.   236 00:22:40,960 --> 00:22:45,840 And then from there we took that information, which  included, again, not only the photo but also things   237 00:22:45,840 --> 00:22:53,120 like location and a hash of the image, all that  complex metadata, and we were able to syndicate it   238 00:22:53,120 --> 00:23:00,160 out to different decentralized systems. In this  case, the first step was to syndicate it out to a   239 00:23:00,160 --> 00:23:05,040 private permission system with IBM. This is a  form of a form of blockchain technology that's   240 00:23:05,040 --> 00:23:10,000 called a private permission ledger. So that's  the first step. And then the second step was, we   241 00:23:10,000 --> 00:23:13,760 put it on a public permission list ledger, which  is similar to something you can think of almost   242 00:23:13,760 --> 00:23:19,600 like Bitcoin, where we were able to store a hash  of that information also out on the public web.   243 00:23:19,600 --> 00:23:23,920 So this allowed you to preserve privacy and  also have a public system of verification   244 00:23:25,120 --> 00:23:28,720 I think we could have never imagined what we  were actually going to capture during those 78   245 00:23:28,720 --> 00:23:33,200 days. I think they caught all of us by surprise  in terms of how historic they proved to be.   246 00:23:34,000 --> 00:23:40,640 But what I can say is that the efforts of our  technology development were certainly on --   247 00:23:40,640 --> 00:23:44,800 they weighed very heavily with us as we thought  about what we were doing in helping think about   248 00:23:45,680 --> 00:23:48,880 the restoration of trust, because surely I  think we can all agree, no matter what side   249 00:23:48,880 --> 00:23:53,520 of the aisle we're on, but the demonization of  the free press is something that we should all   250 00:23:53,520 --> 00:23:59,760 strive to end. And, hopefully, the work that we're  doing in creating an archive that can sustain   251 00:24:00,480 --> 00:24:06,400 the challenges of misinformation and the  challenges of manipulation through social media   252 00:24:06,400 --> 00:24:11,920 is a really good step in that direction. And so you  can check out the website yourself:  253 00:24:11,920 --> 00:24:17,840 starlinglab.org/78days. It'll give you a chance to play around  with the archive and also see more in-depth   254 00:24:17,840 --> 00:24:23,440 explanations of the technology. I'll wrap  up here at the last minute about our learnings.   255 00:24:25,120 --> 00:24:30,480 You matter. I's probably the biggest thing I can mention,  which is that institutions like libraries and   256 00:24:30,480 --> 00:24:36,480 archivists are a key part of creating a solution  that is network and that as a community   257 00:24:36,480 --> 00:24:40,960 if we can all come together to guarantee  the integrity of information, we're in a   258 00:24:40,960 --> 00:24:45,280 unique position to create a new foundation  of digital trust. So it takes that form of   259 00:24:45,280 --> 00:24:50,400 collaboration, and that when we think about  decentralization, it's not a single destination   260 00:24:50,400 --> 00:24:55,520 but it's an unfolding process in which we  continually strive to bring more and more diverse   261 00:24:55,520 --> 00:25:01,040 nodes into our system. And the more diverse those  notes are, the more that they're going to be able   262 00:25:01,040 --> 00:25:06,800 to store and verify information. And so that's why you might think of multiple ledgers   263 00:25:06,800 --> 00:25:12,240 and multiple decentralized systems coming into  play, because they can allow for a tremendous   264 00:25:12,240 --> 00:25:17,920 amount of diversification of cryptographic  features of performance methods of preservation.   265 00:25:17,920 --> 00:25:25,440 And last, of course, diverse use. Think of  decentralization a lot like biodiversity: this   266 00:25:25,440 --> 00:25:30,720 is how we get resilience as a community and both  at a technical level, and also at a community level.   267 00:25:31,440 --> 00:25:34,400 With that I'll pass it back to  Wendy. Thanks so much for having me.  268 00:25:35,892 --> 00:25:41,412 Thank you , Jonathan. We have some questions, some  really good questions. So one question is how   269 00:25:41,412 --> 00:25:45,652 does this differ actually from bittorrent, which is  a very good question. 270 00:25:46,416 --> 00:25:52,080 there's a lot of similarities actually. Bittorrent works by syndicating  information across multiple different nodes   271 00:25:52,080 --> 00:25:58,480 Some of the big differences in our work  is that we choose nodes so whereas Bittorrent   272 00:25:58,480 --> 00:26:03,600 is meant to be diffuse and random with how  information is spread across and it's optimized   273 00:26:03,600 --> 00:26:09,280 basically at the protocol level. We think about the  decentralization process as something that we want   274 00:26:09,280 --> 00:26:15,440 archives to make have a role in choosing which  nodes they distribute their information. And so   275 00:26:15,440 --> 00:26:18,522 that is a that's a major distinction 276 00:26:18,522 --> 00:26:22,104 Is your tech open source and can you point us to a GitHub? 277 00:26:23,880 --> 00:26:30,000 Yes. To be clear, we've implemented open  source technology. And our prototypes are --   278 00:26:30,000 --> 00:26:35,600 we're in the process of putting out various parts  of our code base. But really we haven't created   279 00:26:35,600 --> 00:26:40,960 any novel technology. We've just created novel  implementation. So I'll be very happy to refer   280 00:26:40,960 --> 00:26:44,720 you over to our website, and if you want  to reach out, I can give you a list of the   281 00:26:44,720 --> 00:26:48,880 different protocols that we've used and all  of those are open source and we are very   282 00:26:48,880 --> 00:26:54,160 firmly committed to being a part of an open source  ecosystem both as contributors and also publishers   283 00:26:55,440 --> 00:27:02,080 So jonathan what's the name of that jpeg embedded  metadata standard. Librarians are very keen on   284 00:27:03,360 --> 00:27:06,142 helping to create better metadata. 285 00:27:06,142 --> 00:27:13,120 Sure, so the link is actually there, I see how there's put it in, which is  the C2PA. There's a very   286 00:27:13,120 --> 00:27:17,600 welcoming and open environment there for people  to weigh in. I think archivists are a key part of   287 00:27:18,160 --> 00:27:24,480 helping us come up with a standard that's going  to be useful for them. So we'd be really happy for   288 00:27:24,480 --> 00:27:29,920 people to contribute to that standard. And it's  based out of the Linux Foundation so it too has   289 00:27:30,480 --> 00:27:33,674 open source commitments. 290 00:27:33,674 --> 00:27:36,148 So someone said, are  you licensing software 291 00:27:37,676 --> 00:27:38,857 As an organization, no. 292 00:27:39,120 --> 00:27:45,600 We're a lab that's experimenting to help create  a sense of the art of the possible and we have   293 00:27:45,600 --> 00:27:50,880 various partners that we work with. Almost all  of them are fully transparent and open source   294 00:27:50,880 --> 00:27:55,200 in their work. That's a key criteria in  working with them. And in that way there's   295 00:27:55,200 --> 00:28:00,240 really no complexities with the licensing here.  You can use that you can support it et cetera   296 00:28:00,240 --> 00:28:06,240 Nicholas Taylor mentions that I had brought up the  incentive and contracts as mechanisms for ensuring   297 00:28:06,240 --> 00:28:12,880 persistence. Can you elaborate on how persistence  is assured or supported in the Starling Frameworks?   298 00:28:14,052 --> 00:28:18,720 Sure. Remember, we're a framework that allows  you to help make better choices. And we use a   299 00:28:18,720 --> 00:28:23,840 variety of different protocols. In each of  those protocols are -- we don't endorse them as   300 00:28:23,840 --> 00:28:27,920 best practices but we're experimenting with them  to understand how they could achieve persistence   301 00:28:29,760 --> 00:28:34,320 I'd say that if you look at currently what's  out there, I would caution people that   302 00:28:34,320 --> 00:28:40,480 there are some big promises that are being  made about immunoability and persistence and   303 00:28:40,480 --> 00:28:46,800 permanence. We as a lab try to avoid those words  because we're concerned that with any of these   304 00:28:46,800 --> 00:28:52,880 technologies and communities. History shows that  really nothing can be guaranteed to be permanent.   305 00:28:53,440 --> 00:28:58,480 And so it really takes active efforts to ensure  that type of thing now. what's new is that you   306 00:28:58,480 --> 00:29:02,560 really have these incentive layers that could  potentially allow people to think about the   307 00:29:02,560 --> 00:29:06,720 creation of endowments for instance that could  persist for years and years if they're properly   308 00:29:06,720 --> 00:29:13,600 architected and if the economics bear out. So  in all the cases whether it's Falcoin or RWeave   309 00:29:14,240 --> 00:29:18,240 and people from Storj are here as well... they  can talk to you about how you can use some of   310 00:29:18,240 --> 00:29:23,120 those incentives to help ensure that people that  are hosting information are incentivized to do   311 00:29:23,120 --> 00:29:28,880 that long term. But the reality is that that's  never a passive effort. The data owners and the   312 00:29:28,880 --> 00:29:34,560 archivists like you have to be involved in helping  architect some of those best practices, and you   313 00:29:34,560 --> 00:29:39,120 shouldn't gloss over the details because it's  really important that everyone understand   314 00:29:39,120 --> 00:29:42,448 what are the incentive mechanisms and the security  mechanisms there 315 00:29:42,618 --> 00:29:50,503 We have some very knowledgeable questioners here. As Kiernan says, is someone  more familiar with LTO storage and trusting the   316 00:29:50,503 --> 00:29:56,740 hashes and bag manifests. Is the idea here that  these are not trustworthy enough in certain contexts?  317 00:29:58,904 --> 00:30:03,040 To be clear I'm not as familiar with  LTO swords, so you can help enlighten me Kiernan.   318 00:30:03,040 --> 00:30:07,520 But what we found is that typically  most archiving organizations will just   319 00:30:07,520 --> 00:30:14,720 have hashes. They'll just store hashes like shot  256 of their underlying data and that is not   320 00:30:14,720 --> 00:30:19,440 enough because unless you sign that information  you really don't have a way of protecting those   321 00:30:19,440 --> 00:30:25,760 hashes and ensuring that they have integrity. So  we're providing not only a hashing, signing but   322 00:30:25,760 --> 00:30:30,960 then also a way of putting that information on  a decentralized ledger. So think about it as like   323 00:30:30,960 --> 00:30:35,760 the belt and suspenders. In this case we're not  taking anything for granted about the integrity   324 00:30:35,760 --> 00:30:40,320 of the hash; instead we are finding multiple  layers of trust that we can put on top of the hash   325 00:30:40,320 --> 00:30:45,280 so that we all ensure that when we look back, let's  say, in 50 years, that we know that that hash was   326 00:30:45,280 --> 00:30:48,843 actually properly created and it was secured  over time 327 00:30:50,000 --> 00:30:57,040 I don't know if you've ever thought of this, but you're speaking to a lot of people  from memory institutions like libraries, museums   328 00:30:57,760 --> 00:31:03,840 Looking in the future where do you see  decentralized storage applied in in their world?   329 00:31:05,200 --> 00:31:09,760 I like to think about it ,when I talk to  the folks at the Shoah Foundation who are on the   330 00:31:09,760 --> 00:31:16,640 archiving side, I like to put their mind at ease  and say I think this is a backup to the backup.   331 00:31:16,640 --> 00:31:21,280 What I mean by that is a starting point  is that this is really cold storage and   332 00:31:21,280 --> 00:31:25,440 it's diffuse. So that means it's going to take  time to reconstitute these types of archives.   333 00:31:25,440 --> 00:31:30,400 And if we need to have a restore event -- and  that's okay because actually that's a that's a   334 00:31:30,400 --> 00:31:36,880 great form of resilience -- is to think about how you  can diversify organizations and geography.    335 00:31:36,880 --> 00:31:42,960 If that takes a little bit longer to get this back  up of a backup back in your hands, I'd argue to   336 00:31:42,960 --> 00:31:47,840 you that that's still really valuable. Having  been part of many technology organizations over   337 00:31:47,840 --> 00:31:52,000 the last 20 years I can't tell you how many times  we've been in a situation where we've trusted our   338 00:31:52,000 --> 00:31:57,840 vendor and trusted all the preparations we've made  and in the end the server that was still standing   339 00:31:57,840 --> 00:32:02,640 was the one that was offline, in the middle  of nowhere, that someone forgot even existed.   340 00:32:02,640 --> 00:32:06,960 Those are the types of things that can  be essentially that type of serendipity, is   341 00:32:06,960 --> 00:32:11,600 something you don't want to bank on. Instead  you want to actually think a little bit ahead   342 00:32:12,160 --> 00:32:16,880 and these types of systems right now in their  current state really can function in that way. 343 00:32:16,880 --> 00:32:28,240 I would say they're outside of your traditional,  your performant forms of storage but  instead are a new way to think about preservation   344 00:32:28,240 --> 00:32:33,520 and as these technologies get more mature then we  can start to move them up in in our priority and reliability 345 00:32:35,133 --> 00:32:40,600 Thanks so much for joining us and for  the great work you're doing with so many different organizations.  346 00:32:40,620 --> 00:32:43,760 Likewise, Wendy. We're always inspired  by you as well. Cheers. Thanks for having me. 347 00:32:44,560 --> 00:32:50,320 Okay, well, let's go on to see some demos. What Jonathan was talking about was cold storage   348 00:32:50,320 --> 00:32:57,120 but what if you wanted active storage at scale.  We're going to be showing you two projects that   349 00:32:57,120 --> 00:33:03,120 try to experiment with that. First i'd like to  introduce to you our Arkadiy Kukarkin. He is one   350 00:33:03,120 --> 00:33:08,720 of the top DWeb engineers working today and  we are so honored and pleased that he works   351 00:33:08,720 --> 00:33:14,080 with us at the Internet Archive. He was the  founding CTO of organization called Media   352 00:33:14,080 --> 00:33:20,480 Chain which used blockchains to authenticate  the provenance of music. And he also worked for   353 00:33:20,480 --> 00:33:27,840 Protocol Labs which is the parent company of Filecoin. Now we gave Arkadiy this experiment to work on.   354 00:33:28,560 --> 00:33:34,560 Could you take a different type of data file, in  this case WARCs, or web archive files, and could you   355 00:33:34,560 --> 00:33:41,200 store them at scale across the Filecoin network?  And we chose this collection, the end-of-term   356 00:33:41,200 --> 00:33:47,280 archive from 2016. Now that was at the end of the  Obama Administration, the beginning of the Trump   357 00:33:47,280 --> 00:33:54,720 administration, and it gathered together the entire  federal presence, every dot gov and dot mil website   358 00:33:54,720 --> 00:34:01,200 at that time. It was a collaborative collection. The  Library of Congress, Stanford, California Digital   359 00:34:01,200 --> 00:34:06,720 Library, and many institutions worked together  with the Internet Archive to pull this together .  360 00:34:06,720 --> 00:34:11,680 It's about 200 terabytes large. Now, if you  were going to replicate it three times, that's   361 00:34:11,680 --> 00:34:19,680 600 terabytes you need. It's about 20,000 items,  a million files, billions of individual URLs so   362 00:34:19,680 --> 00:34:23,726 Arkadiy can you show us how you've been doing? 363 00:34:23,726 --> 00:34:36,080 Hello. My name is Arkadiy Kukarkin and I'm going to show you how our experiment here is going so far. Let's just get started.   364 00:34:36,080 --> 00:34:43,200 We use two technologies here primarily: IPFS and  Filecoin. IPFS you can think of as a way to   365 00:34:43,200 --> 00:34:47,280 locate and retrieve content through a  peer-to-peer network, and Filecoin you   366 00:34:47,280 --> 00:34:53,360 can think of as a way to ensure, or at least  attempt to ensure, the long-term preservation   367 00:34:53,360 --> 00:35:01,760 of that content. Probably the best way to  dive in is to just look at a simple example   368 00:35:01,760 --> 00:35:08,400 I have here IPFS enabled in my browser. I'm using Brave, but you can also install   369 00:35:08,400 --> 00:35:14,160 an extension to do this in any other  browser as well. We can take a look   370 00:35:14,160 --> 00:35:21,040 at my node here. Here's some stats. But the  most interesting thing is probably the peer list   371 00:35:21,040 --> 00:35:28,560 which may take a second to populate, but you can  see I'm connected to almost 1,400 peers throughout   372 00:35:28,560 --> 00:35:36,880 the world and as they're coming up now. We  actually see some in Russia and Ukraine as well   373 00:35:37,440 --> 00:35:41,840 which is an interesting demonstration of the  resiliency of these peer-to-peer connections   374 00:35:41,840 --> 00:35:50,080 because as you know web traffic to those  places is currently disrupted. So let's take a look   375 00:35:50,080 --> 00:35:56,560 at just a simple image file here on the  METRO website. We can import it into   376 00:35:56,560 --> 00:36:06,160 IPFS just like you could any normal file. Okay,  here it is. Let's take a look.   377 00:36:07,840 --> 00:36:14,960 Bam. Here's our our image and you can  see sort of funny-looking URL here at the top.   378 00:36:14,960 --> 00:36:21,600 Hopefully you can read that. Instead of  HTTP we have IPFS and then we have this   379 00:36:21,600 --> 00:36:28,560 sort of scary-looking long identifier. What  happened here is that the file was loaded into   380 00:36:28,560 --> 00:36:37,920 my local node and hashed and made available  to the entire IPFS network. If anyone   381 00:36:37,920 --> 00:36:43,760 pretty much anywhere in the world were to enter this IPFS URL, they would be able to access this   382 00:36:43,760 --> 00:36:48,720 file, maybe from my machine, maybe from another  machine that also happens to have the same one,   383 00:36:48,720 --> 00:36:55,200 maybe from an intermediate node someone in that  network of 1,400 machines that I've showed you.   384 00:36:55,760 --> 00:37:04,000 I think this is already cool because you're  able to access a file simply by its identifier,   385 00:37:04,000 --> 00:37:10,240 the CID that Jonathan mentioned already without  knowing or really hearing where it came from.   386 00:37:10,240 --> 00:37:19,760 The reason that works is that the CID is  actually -- well, it's a little bit truncated   387 00:37:19,760 --> 00:37:27,520 here -- but they see this long string is in  fact an encoding of a content hash, which, again   388 00:37:27,520 --> 00:37:32,560 was mentioned by Jonathan. We're not  applying as rigid of a standard here so   389 00:37:32,560 --> 00:37:38,240 it's not a sign hash, but nonetheless if you  request this particular identifier you are   390 00:37:38,240 --> 00:37:44,960 pretty much guaranteed to get the exact same  file back. I think that's already pretty   391 00:37:44,960 --> 00:37:53,600 cool because if we think about something like  the lifetime of hyperlinks in a research favor,   392 00:37:53,600 --> 00:38:00,080 so this is just the graphic I pulled down. After just a few years, something close to 50%   393 00:38:00,800 --> 00:38:07,200 of all hyperlinks across academic papers are no  longer resolvable. Maybe they exist elsewhere,   394 00:38:07,200 --> 00:38:12,000 Let's say the Internet Archive has archived a  copy on the Wayback Machine, or someone else has a   395 00:38:12,000 --> 00:38:19,440 copy. But the actual link is broken and needs  to be manually fixed or followed and   396 00:38:19,440 --> 00:38:27,120 for trust to be insured. Imagine the same paper  using these references instead of a traditional   397 00:38:27,120 --> 00:38:34,480 URL. It will just work as long as another copy  is available in the network. Let's move on to   398 00:38:35,040 --> 00:38:41,200 a real example. The data set that Wendy  mentioned is the End of Term web archive.   399 00:38:41,200 --> 00:38:50,080 We're using the 2016 version which I think is  probably a relatively hot set, as it were, and   400 00:38:50,080 --> 00:39:00,080 here's a copy that's just  available on the web. You can load a   401 00:39:00,080 --> 00:39:06,720 page here. It's a little bit slow, but here we are.  Here's the Indianapolis FBI Bureau in fall   402 00:39:06,720 --> 00:39:17,280 of 2016. Here's what the the backing  data looks like. This is just a whole lot of   403 00:39:17,280 --> 00:39:27,840 basically gigabyte-sized WARC web archives  and so just as before we have the CID identifier 404 00:39:28,880 --> 00:39:37,200 and we can pull it up, and in fact we can  actually load it into some tools that have   405 00:39:37,200 --> 00:39:43,200 already added IPFS loading support. Here's a  replay web.page which is actually just a static   406 00:39:43,200 --> 00:39:51,040 file that loads from IPFS itself as well  and lets you browse the collection, so that's   407 00:39:51,040 --> 00:39:59,600 already pretty cool. If you're a researcher,  an archivist you may already de facto have a copy   408 00:39:59,600 --> 00:40:05,280 of this, having accessed it, so we have lots of copies,  they're keeping stuff safe, but is it safe enough?   409 00:40:05,920 --> 00:40:12,240 I think in this case it's actually probably not  the case because this is important data but it's   410 00:40:12,240 --> 00:40:19,680 a very large amount of data, and it's data that  will probably sit around, not looked at for the   411 00:40:19,680 --> 00:40:28,080 most part, until you actually need it. So what  do we do? Well, one solution is FilePoint.   412 00:40:28,080 --> 00:40:35,120 We're using a tool called Estuary. Estuary is one  of several clients for theFileCoin network   413 00:40:35,120 --> 00:40:44,640 What it does is essentially manage storage  deals within Filecoin, with the Filecoin basic   414 00:40:44,640 --> 00:40:52,000 primitive is a "deal." And it is made between you  as the client and any number of storage providers.   415 00:40:52,000 --> 00:41:00,320 Here is a global map of the storage providers  online currently and at the end of the day   416 00:41:00,320 --> 00:41:04,880 I care about where they're located, but because  of the promises of the network and the protocol.   417 00:41:04,880 --> 00:41:12,880 I actually don't care who I'm talking to, exactly,  because the storage integrity is a protocol-level   418 00:41:12,880 --> 00:41:22,080 primitive. Here we have a 3x replication across  some files and we can take a look here   419 00:41:22,080 --> 00:41:28,720 The bright green is fully online and some of  these others had actually shown storage faults   420 00:41:28,720 --> 00:41:36,880 and the Estuary system has now gone ahead and  recreated these additional replicas. They're now   421 00:41:37,600 --> 00:41:42,720 in the process known as sealing. And we  can take a look. Here we have a provider -- 422 00:41:45,680 --> 00:41:49,760 the provider, I don't actually know  much about them -- but we can take a look.   423 00:41:51,440 --> 00:41:58,240 Here they are, and this is a replica  that we have in Montreal, so that's great.   424 00:41:58,960 --> 00:42:04,720 I'd like to make a very quick note here, which is that coin might make you think   425 00:42:04,720 --> 00:42:11,920 of energy usage, of danger to the environment,  and that is a very reasonable concern.   426 00:42:11,920 --> 00:42:17,280 The important thing to realize is that it  does not use the wasteful proof-of-work mechanism   427 00:42:17,280 --> 00:42:23,280 of Filecoin. The actual ongoing data  verification that happens at the protocol level   428 00:42:23,280 --> 00:42:28,800 also ensures the integrity of the network. You can  read more about it here at this link, and you can   429 00:42:28,800 --> 00:42:35,920 look at the volunteer energy disclosures at the  https://filecoin.energy/. Of course, there are many   430 00:42:35,920 --> 00:42:44,880 other systems attempt to solve these  problems as well. There's IPFS cluster which   431 00:42:45,520 --> 00:42:52,080 is a sort of collaborative backup  solution. There's Textile, which is another   432 00:42:52,080 --> 00:42:59,280 Filecoin client tool. There's Storj,  which will be right up next. There's Arweave,   433 00:42:59,280 --> 00:43:06,640 which aims to achieve a long-term or potentially  infinite storage with a finite upfront cost,   434 00:43:06,640 --> 00:43:12,480 which is initially an economic experience, and many  others. You can try this out yourself at   435 00:43:12,480 --> 00:43:19,040 estuary.tech. which is a public estuary node that's  already hosting hundreds of terabytes of data   436 00:43:19,680 --> 00:43:25,964 for its users. I think that's it. Thank you. 437 00:43:25,964 --> 00:43:33,520 Thank you so much, Arkadiy. You'll be hanging out with us later if people have more questions, and  you could probably answer some questions right   438 00:43:33,520 --> 00:43:39,040 there in the chat. And we're going to come back  to questions with you and Dominick, so let's move   439 00:43:39,040 --> 00:43:44,640 on though to our second demonstration. I'd like to  introduce you to Dominick Marino. He's the senior   440 00:43:44,640 --> 00:43:51,600 solutions architect of Storj. Storj is probably  the oldest decentralized storage company out there   441 00:43:51,600 --> 00:43:58,080 and with Storj, the Internet Archive has been  working to store Librivox audio books at scale.   442 00:43:58,080 --> 00:44:02,070 so here to show us how that work is going please  welcome Dominick Marino 443 00:44:02,749 --> 00:44:04,759 Thank you so much Wendy. 444 00:44:04,759 --> 00:44:09,920 very excited to be here speaking with  everyone today i'm Dominic, Solution   445 00:44:09,920 --> 00:44:15,520 Architect at Storj and we're one of the  leading providers of decentralized storage.  446 00:44:17,280 --> 00:44:22,240 We're very proud of our track record  over the last, oh goodness, it's been   447 00:44:22,240 --> 00:44:28,160 about eight years since Sean Wilkinson  founded us in in 2014 in his dorm room.   448 00:44:29,440 --> 00:44:35,120 We've been really excited to work with the  internet archive decentralizing the LibriVox   449 00:44:35,120 --> 00:44:42,800 audio book series. It's a collection of over 16,000 titles and approximately 22 terabytes of data.  450 00:44:44,800 --> 00:44:51,040 I've worked very closely with Arkadiy and  have had a great time learning with him as we   451 00:44:51,040 --> 00:44:56,960 grow this at scale, bringing these  massive collections into storage. I'm   452 00:44:56,960 --> 00:45:02,160 happy to show what we've done today. The first  thing I'm going to do is tell you what we've done,   453 00:45:02,160 --> 00:45:06,480 and then I'm going to show you how we did it, give  you an explanation of how our network functions. 454 00:45:09,360 --> 00:45:15,760 Over at Storj, we're a decentralized storage  provider with over 13,000 nodes on our network,   455 00:45:15,760 --> 00:45:22,560 of which over 9,000 are independent node operators.  When you upload a file into our ecosystem   456 00:45:23,280 --> 00:45:28,880 You encrypt it, then you split it, and then you  distribute it out to those tens of thousands of   457 00:45:28,880 --> 00:45:35,920 nodes. This gives you -- ultimately, the consumer -- the  control and allows you to remain, if you choose to,   458 00:45:35,920 --> 00:45:43,280 the custodian of the private key. We do in-full  disclosure work in both the web 2 and web 3 space   459 00:45:43,280 --> 00:45:50,400 We're engaged on a daily basis in web 3 related  activity, projects in the space, as well as   460 00:45:50,400 --> 00:45:58,080 offering edge services that allow organizations  in the web 2 space to benefit from the inherent   461 00:45:58,080 --> 00:46:04,000 benefits of the web3 space meaning you can  have a product today that uses something like   462 00:46:04,000 --> 00:46:11,200 amazon's s3 storage and you can benefit from the  resiliency, the redundancy, the performance, the   463 00:46:11,200 --> 00:46:16,960 value that decentralized storage brings you still  in the web 2 space. We're focused in   464 00:46:18,000 --> 00:46:23,680 pushing forward, in being forward-leaning, but  still being able to have a very usable service   465 00:46:24,400 --> 00:46:30,800 by all different sorts of orgs. I'm going to  jump right into a quick demo and show you   466 00:46:30,800 --> 00:46:36,560 some things we've accomplished as well  as a very simple way to use our product.   467 00:46:37,360 --> 00:46:44,080 To do that, I'm going to go through and  do a quick demo of uploading a file here.   468 00:46:44,080 --> 00:46:50,720 The first thing I'm going to do is pop over into  our product, go to our bucket. This is not the way   469 00:46:50,720 --> 00:46:55,680 you need to interact with our network but it's  a way you can interact with our network. So today   470 00:46:55,680 --> 00:47:00,160 I'm just going to go into this bucket and  I'm going to put in a super secure passphrase.   471 00:47:01,440 --> 00:47:05,120 I'm going to understand that I need to remember  that passphrase, because i'm the custodian of it   472 00:47:05,120 --> 00:47:12,640 and this service will not remember it. I'm now  in the bucket and I'm going to upload that file.   473 00:47:13,680 --> 00:47:20,320 When that file uploads, I'm then going to  create a sharer link, paste that share link in,   474 00:47:21,920 --> 00:47:27,360 and view it. Now this is an edge service  we're running that allows you to share out   475 00:47:27,360 --> 00:47:34,880 items to anyone you wish. I'm just going to  post a link so Heather can post that link for you   476 00:47:35,920 --> 00:47:42,480 and you can load this link, but this is -- and it's  hard to see -- on 80 different, there's 80 different   477 00:47:42,480 --> 00:47:46,480 pieces so you can see the distribution around the globe of those pieces   478 00:47:47,440 --> 00:47:52,560 and it's that easy. To show you what we've  accomplished with the Internet Archive   479 00:47:53,440 --> 00:47:58,560 I'm going to go through the  root of their site. I'm going to pop into the book   480 00:47:58,560 --> 00:48:03,179 collection and then I'm going to go to their most  popular book, The Art of War.    481 00:48:03,179 --> 00:48:10,240 The Art of War, for all of us that haven't recently read it or  are unfamiliar with it, it's a book about   482 00:48:10,240 --> 00:48:18,320 avoiding war. War is failure. This  is about taking diplomatic ties to dispersing. 483 00:48:20,480 --> 00:48:27,360 With the Internet Archive, we've uploaded  these 16,000 plus assets, and thanks to Arkadiy, 484 00:48:29,520 --> 00:48:37,200 you can see that all assets related to  this asset are available over at Storj   485 00:48:37,200 --> 00:48:39,920 and soon will be available at the Internet Archive   486 00:48:40,800 --> 00:48:45,360 So you can see how they're using us. It's a  programmatic interaction. They're able to   487 00:48:45,360 --> 00:48:50,640 batch upload. You can see how easy it  is to just use our our simple web UI   488 00:48:51,360 --> 00:48:58,480 to go through and upload an object and share it  and that is backed by a decentralized network.   489 00:48:59,040 --> 00:49:04,400 I'm going to hop back to the presentation and  then hop over to the next slide, which is a summary   490 00:49:05,600 --> 00:49:10,560 of what we've accomplished, a summary that you can  stream on the right here; you can see what it looks   491 00:49:10,560 --> 00:49:15,680 like. We've made a mock in the center as well  as the list of items on the right hand side.   492 00:49:15,680 --> 00:49:20,400 I'm just going to jump to a final slide and  cover a few more things about the network.   493 00:49:20,400 --> 00:49:25,520 What we're really excited about at Storj  is that we're given the creative freedom to   494 00:49:25,520 --> 00:49:31,440 produce what we need to be successful. That  is, to build what people want. When we're   495 00:49:31,440 --> 00:49:35,440 talking about things like IPFS and, Heather,  I'm going to send you another link to share,   496 00:49:36,800 --> 00:49:44,320 this same image that we just uploaded has  also been shared via an IPFS hash. I sent you   497 00:49:44,320 --> 00:49:49,920 a link to be embedded in the chat.   Now the ipfsdemo.devnet.storg.io   498 00:49:50,640 --> 00:49:57,600 showing that not only is our storage decentralized  but content addressable as well. That's something   499 00:49:57,600 --> 00:50:03,840 that's not in production today but it's coming  very soon. It's just so fantastic to be in an org   500 00:50:03,840 --> 00:50:09,520 that provides so much opportunity to build great  things for tomorrow. As far as a little bit more   501 00:50:09,520 --> 00:50:16,480 detail, I see a question on how does distribution  occur. We had a PhD economist actually build   502 00:50:16,480 --> 00:50:22,320 the model, so all of the nodes on our  network -- we don't run those nodes by the way, those   503 00:50:22,320 --> 00:50:27,040 are people who come in and choose to run them -- are  incentivized to be good actors on the network. We   504 00:50:27,040 --> 00:50:33,280 can't trust that they will be, of course, so we have  an audit and repair process that continually runs.   505 00:50:33,840 --> 00:50:40,320 That audit and repair process means that if  a node drops off the network, or a node is   506 00:50:40,320 --> 00:50:46,400 misbehaving in the network, or I know it's simply  just performing poorly, the power is out, we can   507 00:50:46,400 --> 00:50:53,840 address that. We will manage all repair. We  manage all audit. There's no need to negotiate,   508 00:50:54,880 --> 00:51:00,080 for instance, and have maybe inconsistent  pricing. You pay one price and all of that   509 00:51:00,080 --> 00:51:06,880 is done behind a service level agreement, SLA, a  contract where we guarantee a level of service   510 00:51:06,880 --> 00:51:14,880 to you. We are a product today that you can  use in your production application. You can get   511 00:51:14,880 --> 00:51:22,160 the benefits of that global distribution if you  want to be distributed, yet have data sovereignty.   512 00:51:22,160 --> 00:51:29,200 We do that as well. If you, for instance, are  trying to seek GDPR compliance, you want to be   513 00:51:29,200 --> 00:51:34,640 decentralized in Europe, you don't want the data  anywhere else but the European union, no problem.   514 00:51:35,360 --> 00:51:41,680 Conversely, if you're doing that in the United  States for a reason or Canada, no problem. We're   515 00:51:41,680 --> 00:51:47,120 the only provider giving you  that decentralized storage solution with native   516 00:51:47,120 --> 00:51:55,600 sovereignty, highly usable decentralized  storage with multiple on-ramps, making it easy.   517 00:51:56,160 --> 00:52:02,640 As you've seen, for the Internet Archive to  decentralize that large catalog of audiobooks,   518 00:52:05,440 --> 00:52:09,360 it's truly wonderful. and  I'm very fortunate to be here.   519 00:52:10,320 --> 00:52:15,200 Wendy, with that I'm going to wrap, and we  can take care of the rest of course in Q+A.   520 00:52:16,880 --> 00:52:23,200 Great! Thank you so much. Let's call Arkadiy  and Dominick back and we'll stop sharing   521 00:52:23,200 --> 00:52:32,240 the screen and answer a few of your questions. So  one of the questions is, how can you really prove   522 00:52:32,240 --> 00:52:38,160 back from that hash that the originator  did not fake the location?   523 00:52:39,440 --> 00:52:44,240 I guess, how do we know that the hash  is really trustworthy 524 00:52:44,240 --> 00:52:52,800 As per Jonathan Dotan's presentation, you're very much right to ask that. So the hash   525 00:52:52,800 --> 00:52:59,600 is only as trustworthy as the context of its  creation, so obviously we end up with a certain   526 00:52:59,600 --> 00:53:06,560 need to establish a root of trust. One  way that I can see this working out is if you can   527 00:53:06,560 --> 00:53:14,320 imagine the Internet Archive as a catalog and as  a data store. Right now you need both places 528 00:53:18,640 --> 00:53:25,600 -- "archival bond," I'm actually not a familiar with this term -- so imagine the catalog and the   529 00:53:25,600 --> 00:53:33,520 data store as separate ideas. If you  trust the catalog, you don't necessarily have to   530 00:53:33,520 --> 00:53:41,120 trust the data store as if they are linked  through a cryptographically secure hash, so you   531 00:53:41,120 --> 00:53:47,760 can imagine, for example, a censorship-resistant  InternetAarchive where you only need to ensure   532 00:53:47,760 --> 00:53:54,240 the integrity when transmission of the catalog  portion. And then the data can be retrieved from   533 00:53:54,240 --> 00:54:00,810 any number of decentralized networks underlying  it. And that relationship is trust. 534 00:54:00,810 --> 00:54:06,000 So here's a question about decentralized storage working  with digital preservation how does it handle, for   535 00:54:06,000 --> 00:54:08,777 instance, file obsolescence. 536 00:54:08,777 --> 00:54:13,324 File obsolescence. So let's dig a little bit deeper into that. 537 00:54:13,324 --> 00:54:18,281 That is the concept of a file not being necessary after a period of time. 538 00:54:18,281 --> 00:54:24,160 Is that how we want to think about it? Or are we thinking about the concept of maybe like bitrot? 539 00:54:24,160 --> 00:54:33,520 I would guess the first and maybe Dina can help us there. Let's say you don't need this file anymore it's defunct   540 00:54:33,520 --> 00:54:40,960 How easy is it to get rid of files, take  them down? In other words, is it like Glacier   541 00:54:40,960 --> 00:54:47,600 where it's really hard to move things around, or  can you call and change things kind of at will? 542 00:54:47,600 --> 00:54:53,840 I can dive into that. So at Storj we we say  it's hot storage for the price of cold.   543 00:54:53,840 --> 00:54:58,960 You don't have to deal with the tiers. There is no,  for instance, auto-tiering or lower tier layer. We   544 00:54:58,960 --> 00:55:05,680 won't let you erase your code to non-ideal ratios.  We just do it. That being said, however, you   545 00:55:05,680 --> 00:55:12,000 handle your file management. And this is true for  all storage backends, will be how you manage the   546 00:55:12,640 --> 00:55:20,480 archival and potential deletion of assets at a  period of time. The storage back ends generally   547 00:55:20,480 --> 00:55:25,200 wouldn't be responsible for that; it would be  the data management layer, the front end, that   548 00:55:25,200 --> 00:55:27,546 is managing that archive. 549 00:55:27,546 --> 00:55:32,293 Well, with that, I think we are going to wrap up this session and 550 00:55:32,293 --> 00:55:37,418 ask everyone to join us at the next session for more to and fro. 551 00:55:37,418 --> 00:55:42,920 We hope you did enjoy what you heard today and it was just a taste, a beginning. 552 00:55:42,920 --> 00:55:49,107 So please come back for more. We're doing this for six months, the last Thursday of every month, 553 00:55:49,107 --> 00:55:56,589 this is number two, at 1 pm pacific / 4 pm eastern We have four more sessions here, three of them. 554 00:55:56,589 --> 00:55:58,880 The next one in March is on decentralized identity. 555 00:55:59,520 --> 00:56:06,240 and we've also as mentioned developed this  really beautiful resource guide with different   556 00:56:07,200 --> 00:56:13,840 videos, links to other companies that do this, other  organizations, deeper dive reading. So please   557 00:56:13,840 --> 00:56:19,040 take a look. We're dropping the link to that in the  chat. It will be emailed to you if you registered for this.  558 00:56:19,040 --> 00:56:29,840 And please share it widely. That's why  it's there. Finally I just want to say thank you.